-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(hogql): cohort left join conjoined #19725
Conversation
Size Change: 0 B Total Size: 2 MB ℹ️ View Unchanged
|
Hey @Gilbert09! 👋 |
8a9b626
to
7db98bb
Compare
# Conflicts: # mypy-baseline.txt # posthog/hogql_queries/insights/trends/breakdown.py # posthog/hogql_queries/insights/trends/trends_query_runner.py
7db98bb
to
9265096
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jumping to a call, dumping thoughts
if (isinstance(arg.value, int) or isinstance(arg.value, float)) and not isinstance(arg.value, bool): | ||
cohorts = Cohort.objects.filter(id=int(arg.value), team_id=context.team_id).values_list( | ||
"id", "is_static", "name" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the float? It seems like some fields in the schema didn't come with /** @asType integer */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole breakdown
value type union is pretty messed up. Would love to break it down some more into multiple fields with better typing. I've had to use # type: ignore
in too many places due to this
if not isinstance(arg, ast.Constant): | ||
raise HogQLException("IN COHORT only works with constant arguments", node=arg) | ||
|
||
if (isinstance(arg.value, int) or isinstance(arg.value, float)) and not isinstance(arg.value, bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same float question here
""" | ||
SELECT person_id AS cohort_person_id, 1 AS matched, cohort_id | ||
FROM static_cohort_people | ||
WHERE {cohort_clause} | ||
UNION ALL | ||
SELECT person_id AS cohort_person_id, 1 AS matched, cohort_id | ||
FROM raw_cohort_people | ||
WHERE {cohort_clause} | ||
GROUP BY cohort_person_id, cohort_id, version | ||
HAVING sum(sign) > 0 | ||
""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we figure out which ones are static and which are dynamic, and remove one side of this union when needed? I think this could save anywhere between a bit and a bunch of CPU cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can check if we're querying just static, just dynamic, or both for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. I see now as well why multiple joins with an OR wouldn't have really worked. Thinking now, there probably would have been a way to get it working with array joins, but it'd then still make multiple queries to the same cohorts tables, and possibly cause other issues with other aggregations. So LGTM, but I'd still see if there's something to do about the typing int/float typing. Can probably be later as well.
Edit: also there are some failing tests.
FROM cohortpeople | ||
WHERE and(equals(cohortpeople.team_id, 420), in(cohortpeople.cohort_id, [1])) | ||
GROUP BY cohort_person_id, cohortpeople.cohort_id, cohortpeople.version | ||
HAVING ifNull(greater(sum(cohortpeople.sign), 0), 0)) AS __in_cohort ON equals(__in_cohort.cohort_person_id, events.person_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: cohortpeople.sign is deprecated, shouldn't be used, this doesn't work properly here. Rather, we look at the version, if it's equal to the version on the cohort.
I see sign is still used in a few places for cohort calculations, but I think those places don't face this same issue 🙈 (for various reasons). In querying though, stick to version!
This means somehow passing the current cohort version value to the query generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically, the problem here is some stale versions of the cohorts might have sum(sign) > 0 as well, which can then lead to extra persons being filtered, iirc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neilkakkar Cool, so, I need to grab cohort.version
and filter cohort_people
+ removing the sign
clause? 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, should work well!
This means you can get rid of the group by too, which should speed things up a lot, just copy this query:
https://github.com/PostHog/posthog/blob/master/posthog/models/cohort/sql.py#L75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dreamy, have added this in, but need to remove the group by. Thanks!
Problem
subquery
cohort method.Changes
leftjoin_conjoined
. Built very similar to theleftjoin
modifier, but instead of having aLEFT JOIN
for each cohort breakdown, we instead have a singleLEFT JOIN
for all cohortsHow did you test this code?